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ABSTRACT 

Sixty subjects participated in an experinjent 
involving estimation of difficulty of items in a test pf reasoning 
ability* The estimates were to be given both according to 
conventional conditions of magnitude estimations with a preassigned 
comparison standard and according to a modified procedure of 
magnitude estimation where the comparison standard was chosen 
individually by the subjects themselves, . The test itself was 
administered to the subjects under standard conditions prior to the 
estimation procedures, when comparing the t%ra methods of estimatiqn 
used, a high correlation between estimates and a close correspondence 
of the modified method of magnitude estimation to the methods of 
ratio estimation and similarity estimation was noticed* A jbigh 
correlation (r=Oc 90) between the rank order of items according to 
perceived difficulty and the item sequence was found. . Fairthermore, 
estimated difficulty could tentatively be described as a positively 
accelerated function of standard scores corresponding to solution 
frequencies « The relative increase of perceived difficulty was more 
pronounced for subjects with a high performance score on the test 
than for subjects with a poor performaixe score. Probable causes of 
the results obtained as well as possible secondary factors affecting 
the estimates of perceived difficulty are discussed, . (Author) 
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PERCEIVED DIFFICULTY OF ITEMS IN A TEST OF REASONING 

ABILITY * 



Bratfisch, O. , Domic, S. , and Borg, G. Perceived dif- 
ficulty of items in a test of reasoning ability* Reports 
from the Institute of Applied Psychology, the University 
of Stockholm, 1972, No. 28. - Sixty subjects participated 
in an experiment involving estimation of difficulty of 
items in a test of reasoning ability. The estimates were 
to be given both according to conventional conditions of 
magnitude estimations with a preassigned comparison 
standard and according to a modified procedure of mag- 
nitude estimation where the comparison standard was 
chosen individually by the subjects themselves* The test 
itself was administered to the subjects under standard 
conditions prior to the estimation procedures. When 
comparing the two methods of estimation used a high 
correlation between estimates and a close correspon- 
dence of the modified method of magnitude estimation to 
the methods of ratio estimation and similarity estimation 
was noticed, A high correlation (r = 0. 90) between the 
rank order of items according to perceived difficulty 
and the item sequence in the test was found. Further- 
more, estimated difficulty could tentatively be described 
as a positively accelerated function of standard scores 
corresponding to solution frequencies. The relative in- 
crease of perceived difficulty was more pronounced for 
subjects with a high performance score on the test than 
for subjects with a poor performance score, - Probable 
causes of the results obtained as well as possible sec* 
ondary factors affecting the estimates of perceived dif- 
ficulty are discussed. 



Introductior 



It is typical of the measurement of intelligence that it usually 
starts from ^^objective" performance. In fact, performance scores 
are commonly the basis both for determining the quantity of a person's 
intellectual capacity and for the analysis of the dimensionality of 
intellectual performance by means of correlational techniques. 



* This investigation was supported by a re;earch grant to Professor 
Gunnar Borg from the Swedish Council for Social Science Research 
(Project number 439/71 P), 
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Surprisingly little attention, however, has been paid to the question of 
how the content (dimensionality, quality) and difficulty (quantity) of in- 
tellectual tasks are experienced by the performing persons themselves 
and what the relation between such "subjective^* and the above named 
^'objective" measurements might be. As far as perceived quality is 
concerned only two studies are known to us (Bratfisch & Ekman, 1969; 
Bratfisch, 1971) while perceived difficulty comparatively has been 
studied to some extent. The majority of the latter type of studies were 
concerned with the relation between the perceived difficulty of intellec- 
tual tasks and its ''objective** counterpart as based on performance 
(Borg & Forsling, 1964, 1965 and 1967; Borg, 1966, 1968, and 1969; 
Munz & Jacobs, 1971), while some others primarily were interested in 
the possibility of increasing test reliability by using *'subjective^^ 
measurements (Backman & Wedman, 1971). In related studies the re- 
lation between self-estimated effort and physic<xl performance v^s in- 
vestigated (e. g. Borg, 1962). 

The present investigation is once more concerned with the perceived 
difficulty of intellectual tasks and can be regarded as a continuation of 
Borg^s study in 1969. Borg used in his study sets A, B, D, and E from 
Raven^s well known test "Standard Progressive Matrices" (Raven, i960) 
as stimulus material, i.e. altogether 48 tasks. The tasks were admin- 
istered to the subjects (34 students) in randomized order. After the 
testing session they were asked to give their estimates of the difficulty 
of the individual items using the method of magnitude estimation (see 
e. g. Stevens, 1957). In contradistinction to the usual way of employing 
this method, an "imagined standard" was used: the subjects were in- 
structed to call the "medium degree of difficultv" 10 and to estimate 
the difficulty of the other tasks in relation to this kind of standard. 

The method worked very well and the results showed a close re- 
lationship between the item sequence according to estimated difficulty 
and the rank order according to the tasks^ position in the test. The co- 
efficients of correlation were 0. 77, 0. 89> 0. 87, and 0. 85 for Sets A, 
B, D, and E, respectively. When perceived difficulty was. plotted 
against standard scores (z-values) corresponding to the solution fre- 
quencies of the individual items, a linear relation was obtained* This 
relation concerns, however, only 14 items of Sets P and E, for which 
solution frequencies were available from a different group of 100 sub- 
jects. It was proposed that such a finding might oe important for test 
construction. It was pointed out that, particularly from the motivational 
point of view, it might be better to rank the tasks of a test also with ref- 
erence to perceived difficulty and not only according to statistics based 
on performance. 



The experiments 



Reasoning ability plays an important role in all contemporary re- 
search on intelligence based on correlational and related investigation. 
Though a considerable variation exists as far as terminology and theor- 
etical basis are concerned a far reaching agreement with regard to sig-*' 
nificance and meaning of the factor in a general sense can be noticed. 
The measurement of reasoning ability, however » tends in practice to be 
limited to one aspect of it - inductive reasoning ability. Typical tests 

ERLC 



. 3 - 



in this connection are "Number series** and ''Matrices'*. Thurstone 
would call the factor represented by such a test just "reasoning (R)" 
(Thurstone, 1938), Guilford would name it "cognition of semantic re- 
lations (CMR)" or "cognition of semantic systems (CMS)" (Guilford, 
1967) while Meili would refer to it as "Komplexitat (K)" (Meili, 1944). 

As reasoning ability plays the central role in research on intelli- 
gence outlined above it was decided to pick a typical factor test of this 
kind for the study of perceived difficulty of intellectual tasks. 

The test "Matrices" used in the present experiments is a *itandard 
test of the Institute of Applied Psychology regularly applied in connection 
with vocational guidance. It consists of 24 items selected from the 60 
items of the original Raven test. 10 of them belong to Set 8 belong 
to Set D, and 6 to Set E. The 24 items are denoted by letters from the 
Swedish alphabet. The test proper fs preceded by 3 practice items 
selected from Sets C and D* 

Experimental c on ditions 

In the first part of the experiments the test was administered to 
the subjects under standard conditions. In the second part they were 
asked to estimate the perceived difficulty of the individual items in re- 
lation to a standard item. 

Two different procedures were applied. In Procedure 1 , .an item 
with a solution frequency close to 50 (item "0" of the test« correspond- 
ing to item C 10 in Raven^s original test) was chosen as standard for 
all the subjects and assigned the scale value "10". (Solution frequencies 
were available beforehand from a group of 100 vocational guidance 
clients of the Institute of Applied Psychology* These solution frequencies 
and the corresponding z-values are shown in Table 1)* The subjects^ 
task was to estimate the difficulty of the remaining 23 items in relation 
to the standard, using the method of magnitude estimation. The order 
of items was randomized. 

Table 1 Proportions of correct answers (1-p) obtained from a group 
of 100 vocational guidance clients and the corresponding 
standard scores (z) for the individual test items* 
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In Procedure 2 , the subjects were asked to choose their own stan- 
dard item which was defined as the most difficult task. This item was 
assigned the $cale value **100*'. The difficulty of the remaining 23 items, 
which again were arranged in randonriized order, had to be estimated 
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in relation to this individual standard using once again the method of 
magnitude estimation. 

Medians, means and standard deviations of the experimental esti- 
mates according to both the above described procedures are shown in 
Table 2. 

Table 2 Medians (Mdn), Means (M) and standard deviations (SD) of 
the experimental estimates for each of the test items 
according to procedure 1 and procedtire 2« 
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Subjects 

Altogether 60 subjects participated in the experiments « 35 of them 
being students (tmdergraduates from the University of Stockholm) and 
25 vocational guidance clients of the Institute of Applied Psychology. 
The group consisted of 29 males and 31 females, ranging in age from 
16 to 48 with a median age of 25. 5. 

The average performance of the whole group was 16.4 correctly 
solved items I which is 68. 3 per cent of all the items (24) in the test. 
^ The maximum performance was 23 solved items (2 subjects), the mini- 
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mum 3 items (1 subject), the latter being an exception since the rest 
of the subjects solved at least 10 tasks. Fifty per cent of the subjects 
solved between 14 and 19 tasks. 



Re suits 

The main purpose of the present investigation was to study the re- 
lation between item difficulty as perceived by the performing subjects 
themselve^i and "objective" item difficulty based o*i performance. How- 
ever, before the results of the undertaken analysis are reported, meth- 
odological questions concerning the two different procedures of magni- 
tude estimation used will be considered* 

Comparison between scales 

Means and Medians of the experimental estimates of the two pro- 
cedures used are plotted against each other, respectively, in Figure 
lA and IB. No systematic deviations from the linear relationship ob- 
tained can be noticed in either of the graphs. The Pearson coefficients 
of correlation computed are +0. 95 for the medians and +0. 98 for the 
means. 
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Fig. 1. Medians (Diagram A) and means (Diagram B) of the two pro- 
cedures as related to each other* 

Another interesting question in this connection is the inter-individual 
variability of the given estimates* This inter-subject dispersion is 
shown in Figure 2A and B, where standard deviations and the corre- 
sponding means have been plotted against each other for Procedure i 
and Procedure 2, respectively* 




Me«n estimate*. Procedure 1 e»timete». Procedure 2 



Fig. 2, Standard deviations plotted against arithmetic means of esti- 
mates. Diagram A shows data from Procedure 1, Diagram B 
data from Procedure 2, The regression line in Diagram A was 
fitted mathematically, the curve drawn in Piagram B was fitted 
by eye. 

As to Procedure 1 standard deviations can be said to be growing lin- 
early with increasing means, though data are scattered around the 
fitted regression line, as can be seen from Figure 2 A. The same kind 
of linear relationship between means and standard deviations of magni- 
tude estimates has been found in other studies (e.g. Sjaberg, 1969). 
The relation between standard deviations and means with regard to 
Procedure 2, on the contrary, does not at all follow the above described 
trend, as Figure 2 B shows. In the first instance the inverse U-shape 
of the trend in Figure 2 B does not fit in the picture of magnitude esti- 
mates; on second thoughts, however, this result is not at all surprising. 

Let us take a closer look at the two precedures used. Procedure 
1, on the one hand, follows conventional conditions, i.e. a stimulus 
which is expected to lie approximately in the middle of the response 
continuum is denoted "10»' and used as a standard in relation to which 
all the other stimuli are estimated. Procedure 2, on the other hand, 
represents a modification of the conventional method of magnitude esti- 
mation in so far that each- subject is asked to point out the stimulus 
which he experiences as the upper boundary of his response scale. This 
boundary is then used as an individual standard and denoted '400"; the 
estimates of all the other stimuli are given in relation to it. In this way. 
Procedure 2 is basically of the same nature as the ''distance" method 
of ratio estimation (cf. Ekman & Sjdberg, 1965) - where the "bigger" 
percept of a pair is always the standard to which the other percept is to 
be compared - and the "content" method of similarity estimation (cf. 
Ekman & Sjdberg, op. cit. ) - where each pair of percepts is to be esti- 
mated in relation to maximum similarity, i. e. to identity. Certainly, 
more estimates are obtained when using the method of ratio estimation 
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or similarity estimation (each stimulus is compared with each other 
stimulus), giving the methodologist an opportunity to control e, g, the 
consistency of estimates; otherwise, however, there is no difference 
between Procedure Z and these two* 



Having illuminated the basis correspondence of Procedure 2 to the 
methods of ratio estimation and similarity estimation, the inverse U- 
shape seen in Figure 2 B is quite reasonable* Inter -individual variation 
has been found to follow an elliptic trend when plotted against means of 
similarity estimates (Ekman & Kunnapas, 1969) and the relation between 
intra- individual variation and means of ratio estimation as well as simi- 
larity estimation seems to be described as a parabolic arc (Eisler, 
I960; Bratfisch & Ekman, 1969; Bratfisch, 1971). Similar results have 
been obtained by Mashhour (1964)* The analysis of the distributions of 
the estimates of Procedure 2 showed also an accordance with earlier 
findings, i« e« estimates tend to be skewed at both ends of the scale, 
skewness being positive close to the low^r boundary and negative close 
to the upper boundary (see e* g« Ekman & Ktlnnapas, op« cit). 

On account of the high correlation between scales and as also separ- 
ate analysis of the two scales showed almost identical trends with re- 
spect to their relations to ^^objective" difficulty it was decided to use 
averaged data for the further presentation of the results* When averag- 
ing, medians were computed due to the skewed distributions mentioned 
above* 



"Objective" and perceived difficulty 

With the data available, the relation b *'..veen "objective" and per- 
ceived difficulty can be looked upon from two points of view. Perceived 
difficulty may be plotted against the fixed order of items in the test, as 
well as against z-values corresponding to the solution frequencies* 

Figure 3 A shows medians of perceived difficulty plotted against the 
order ol items in the test. The close relationship between the two sets 
of data is quite evident and is numerically confirmed by a Spearman 
coefficient of 0, 90. This result is in line with the result of Borg's ex- 
periment in 1969. 



20 ■ 



1 . 



o o 

o 



o o o 



O o o 



« « « 



2 6 10 14 10 22 



Fig. 3. Medians of estimates as related to the real order of items in 
the testV ^ 
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Figure 4 A shows medians of perceived difficulty as a function of 
standard scores (z -values), corresponding to the proportion of correct 
answers (l-p)* seen in Table 1. The Spearman coefficient of correlation 
between the two sets of data is again 0. 90. Furthermore it seems that 
perceived difficulty is growing as a slightly positively accelerated func- 
tion of standard score, the form of the trend being obscured by a con- 
siderable scatter. To bring out the trend of the data more clearly, the 
median estimates have been averaged for equal successive intervals of 
the standard scores. The range of standard scores was divided into 5 
equal intervals, the interval width being 0. 74. The averaged data are 
shown in Figure 4 B. 




Z -values corresponding to solution frcqutncles 4 Clesstfled Z-velues (S) 



Fig« 4* Perceived difficulty as related to standard scores (z -values). In 
Diagram B medians of estimates are plotted against standard 
scores. Diagram B shows estimates averaged for successive 
intervals. The curve drawn in Diagram B represents Equation 

(1). 

From Figure 4 B it is seen that perceived difficulty grows with increas- 
ing z-values corresponding to solution frequencies. The trend could 
roughly be said to be linear, but a positively accelerated exponential 
function of the form 

P - a • b^ (1) 

(where R denotes perceived difficulty and S z-values, while a and b are 
empirical constants) describes the trend maybe to an even better ap- 
proximation. In two similar investigations (Borg & For slings 1964; 1965), 
a linear relation between perceived difficulty and z-values was found, 
the rank-order correlation between data being 0. 90. 

The next step in our analysis was to classify subjects into subgroups 
homogeneous with respect to sex, age, educational level, and perform- 
ance on the test. As far as subgroups according to sex, age, and edu- 
cational level are concerned data showed in all the above mentioned re- 
spects the same general trend as did the data for the group as a whole, 
and need not, thus, be presented. Performance level, however, seems 
to be of relevance for the estimation of item difficulty. Figure 5 A shows 
medians of perceived difficulty for the 20 subjects performing best on 



the test and for the 20 subjects with the poorest performance on the test 
plotted against the real item sequence in the test. An inspection of the 
diagram shows that the relative increase of perceived difficulty is higher 
for subjects with a high performance score on the test than for subjects 
with a low one. This tendency is seen more clearly when averaging the 
estimates of the two groups by taking 6 groups of 4 items following the 
item sequence in the test. 
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Fig. 5. Perceived difficulty of two subgroups as related to item se- 
quence in the test. In Diagram A medians of estimates are 
plotted against the real order of items in the test, Diagram B 
shows averaged estimates. 

A further comparison of the subjects with the best performance 
against those with the poorest performance showed Spearman coefficients 
of correlations between the order of items according to estimated diffi- 
culty and the order of items in the test of 0. 94 for the "best'' subjects 
and of 0. 86 for the "worst'* ones. In Borg's experiment on Raven 
matrices (Borg, 1969), a slightly higher correlation was found in the 
"best" third of subjects than in the "worst" third of subjects, though the 
difference was not by far so pronounced as in the present investigation. 



Discussion 

The major findings obtained in the present investigation may be 
summarized as follows. (1) A positively accelerated function can tenta- 
tively be said to describe the relation between perceived difficulty and 
z-values corresponding to solution frequencies. Though a high corre- 
lation (0. 90) was obtained between the rank order of items according 
to perceived difficulty and real item sequence- as well as rank order 
according to z-values some items deviated markedly from the generally 
high agreement between "objective" and "subjective" rank orders. (2) 
The relative increase of perceived difficulty seems to be higher for 
subjects with a high performance score on the test than for subjects 
with a low one. (3) A high correlation between the two scaling procedures 
used was noticed. However, the relation between standard deviations 
and mean estimates of the modified method of magnitude estimation 
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found indicates that data obtained by this procedure have properties 
corresponding to data usually obtained by the conventional methods of 
ratio estimation and similarity estimation. 

The major question arising from results (1) and (2) is what possible 
implication they might have on test construction. Though result (2) might 
be interesting from a general psychological point of view (which will be 
discussed later) it would seem that it does not give us more information 
than result (1) as the Spearman correlation between the subjective item 
difficulty of the two groups is high (r=0. 96) indicating that by and large 
the same item difficulty sequence is experienced by both groups, this 
sequence, in turn, being certainly closely related to the item-difficulty 
sequence as experienced by the whole experimental group. 

There is a general agreement among test authorities that the items 
comprising a test should be arranged in order of increasing -'^^^'iculty 
defined by the *'p" index or a similar derivative, thour r ■ no con- 
sistent agreement as to the rational underlying the pra (cf Lund, 
1953; see also Munz & Jacobs, 1971). Results from the few studies on 
this topic known to us point at an item arrangement from easo to hard 
is superior for aptitude tests, i.e. yields higher test scores (Lund, op. 
cit; Sax & Carre, 1962) while no such effect seems to be caused by 
this arrangement as to achievement tests (e.g. Brenner, 1964; Smouse 
& Munz, 1968). However, the test constructor is also concerned "with 
difficulty in a psychological sense as it effects the morale or behaviour 
of the test taker" (Myers, 1962). Now, if already an item arrangement 
from easy to hard based on the "p" indrx has a positive effect on per- 
formance (in connection with aptitude tests) item difficulty sequence 
based on the subject's perception of difficulty would seem as even more 
appropriate as it is likely to increase morale, increase test motivation 
and the like ?»s well. Thus, the "p** index or similar derivatives seem to 
be inferior .o measures of perceived difficulty for the purpose of arrang- 
ing item sequence in a test. Going back to the results of the present in- 
vestigation this would mean that the order of items in the test should be 
rearranged according to the estimates and furthermore that certain items 
would hav to be replaced by new ones or omitted (provided that this 
would no !ect the test's reliability) if we wish to increase perceived 
difficulty c. g. linearly with the item sequence* In this connection also 
the slope of the regression line seems to be of interest. However, exper- 
imental evidence is needed to confirm the above reasoning - a challeng- 
ing task for future research. 

Result (2) is interesting'from a psychological point o? view. The dif - 
ference in the relative increase of perceived difficulty between the "best" 
and the "worst" subjects might be due to several factors - above all to 
the simple fact that the worst subjects "did not know what Ihey were 
judging", in other words, that the ability to estimate the difficulty of the 
tasks adequately depends on the ability to solve them. This might depend 
on the possible fact pointed out by Borg (1966), that an ungitted person 
seems to find it easier to accept a wrong solution, and'thus consider 
a task relatively easier* From the theoretical point of view, a high 
varia n in estimates in a psychophysical experiment may have different 
cause *t might show actual differences in perception as well as differ- 
ences 1a. ^. ability to use numbers; this is an old and unsolved problem 
in the psyci ophysical methodology (cf* , e* g* , Ekman, 1966; Ekman & 
Sjoberg, 1965)* 



Another instance which might illuminate the mechanism at work is 
that the "best" subjects chose, in Procedure 2, in most cases the last 
task C^A") as standard, which probably means that they had recognized 
it as the most difficult one; this was not the case with the "worst'' sub- 
jects, who might have been influenced by some secondary factors. This 
leads us to the question of the "genuineness" of the estimates of diffi- 
culty (Borg et al. , 1970). There are several factors by which the judge- 
ments of difficulty of che task in an intelligence test of Raven's type 
might be contaminated, particularly if the perceived difficulty is esti- 
mated in addition, after having tried to solve the items* There is a 
pos^'iHility that the estimates were contaminated by the perception of 
ti^ -pr t to solve the individual tasks, as was the case in one of our 
pr. io- experiments (Bratfisch et aL , 1970). A positive relationship 
between solution time and estimates of difficulty was also noticed in the 
present investigation. Another alternative is that the estimates of diffi- 
culty were influenced by purely perceptual characteristics (especially 
by the complexity) of the items* The effect of the so-called information 
feed-back (i. e. of the subject's knowledge of the successful solution of 
the tasks) should also be taken into account. The present data, however, 
do not yield enough information for a more thorough analysis of the 
above questions. Under all circumstances we feel that further studies 
should investigate the possibility of obtaining a fast, time saving (though 
probably rough) measure of a person's performance level on a test, just 
by having him estimate the difficulty of the test items (or some of them 
or the test as a whole). 

Result (3), finally, is (apart from methodological questions concern- 
ing scaling procedures) interesting in the light of item selection when 
constructing a test. It has been argued that dispersion measurements^ 
only being available for estimates of perceived difficulty and not for ob- 
jective measurements of difficulty based on performance,, could be im- 
portant for test constructing purposes (Borg et al. , op. cit, ). Further- 
more it has been said that a certain degree of dispersion is necessary, 
but tasks with too great dispersions are probably not suitable either (cf. 
Borg, 1966). The results of the present study indicate that the degree of 
dispersion depends on the scaling technique applied and it seems, thus, 
that nominal measurements of dispersion would not be suitable for test 
constructing purposes. Nevertheless, the problem might be solved by 
using "relative" dispersion measurements, that is relative to the gen- 
eral trend of dispersion? which is obtained when plotting them against 
the corresponding means. A markedly higher (or extremely low) inter- 
individual variability (a too big or extremely small dispersion) as could 
be expected due to the scaling procedure applied could be used as an 
item-selection criterion. The present results, however, did not show 
an agreement between the two procedures used in this respect, i. e. 
items deviating markedly from the general trend in Fig. 2 A did not 
correspond to those deviating markedly fiom the general trend in Fig. 
2 B. Yet, this interesting and for test construction purposes highly im- 
portant question is worth a thorough study in further investigations. 
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